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Abstract —Predicting the future location of users in wireless net¬ 
works has numerous applications, and can help service providers 
to improve the quality of service perceived by their clients. The 
location predictors proposed so far estimate the next location of 
a specific user by inspecting the past individual trajectories of 
this user. As a consequence, when the training data collected for 
a given user is limited, the resulting prediction is inaccurate. 
In this paper, we develop cluster-aided predictors that exploit 
past trajectories collected from all users to predict the next 
location of a given user. These predictors rely on clustering 
techniques and extract from the training data similarities among 
the mobility patterns of the various users to improve the prediction 
accuracy. Specifically, we present CAMP (Cluster-Aided Mobility 
Predictor), a cluster-aided predictor whose design is based on 
recent non-parametric bayesian statistical tools. CAMP is robust 
and adaptive in the sense that it exploits similarities in users’ 
mobility only if such similarities are really present in the training 
data. We analytically prove the consistency of the predictions 
provided by CAMP, and investigate its performance using two 
large-scale datasets. CAMP significantly outperforms existing 
predictors, and in particular those that only exploit individual 
past trajectories. 

I. Introduction 

Predicting users’ mobility in wireless networks has received 
a great deal of attention recently, strongly motivated by a wide 
range of applications. Examples of such applications include: 
location-based services provided to users by anticipating their 
movements (e.g., mobile advertisement, recommendation sys¬ 
tems, risk alarm); urban traffic engineering and forecasting; 
the design of more efficient radio resource allocation protocols 
(e.g., scheduling and handover management 12, data prefetch¬ 
ing 121 and energy efficient location sensing 13]). However, for 
these applications to significantly benefit from users’ mobility 
predictions, the latter should be made with a sufficiently high 
degree of accuracy. 

Many mobility prediction methods and algorithms have been 
devised over the last decade, see e.g. 0-0. The algorithms 
proposed so far estimate the next location of a specific user 
by inspecting the data available about her past mobility, i.e., 
her past trajectory, and exploit the inherent repeated patterns 
present in this data. These patterns correspond to the regular 
behavior of the user, e.g. commuting from home to work or 
visiting favourite restaurants, and need to be extracted from 
the data to provide accurate predictions. To this aim, one has 
to observe the behavior of the user over long periods of time. 
Unfortunately, gathering data about users’ mobility can be quite 
challenging. For instance, detecting the current location of a 
user with sensors (e.g., GPS, Wi-Fi and cell tower) consumes 
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a non-negligible energy. Users may also hesitate to log their 
trajectories to preserve their privacy. In any case, when the 
data about the mobility of a given user is limited, it is hard to 
identify her typical mobility patterns, and in turn difficult to 
provide accurate predictions on her next move or location. 

In this paper, we aim at devising mobility predictors that 
perform well even if the past trajectories gathered for the 
various users are short. Our main idea is to develop cluster- 
aided predictors that exploit the data (i.e., past trajectories) 
collected from all users to predict the next location of a given 
user. These predictors rely on clustering techniques and extract 
from the training data similarities among the mobility patterns 
of the various users to improve the prediction accuracy. More 
precisely, we make the following contributions: 

• We present CAMP (Cluster-Aided Mobility Predictor), a 
cluster-aided predictor whose design is based on recent 
non-parametric bayesian statistical tools 0, ®. CAMP 
extracts, from the data, clusters of users with similar 
mobility processes, and exploit this clustered structure to 
provide accurate mobility predictions. The use of non- 
parametric statistical tools allows us to adapt the number 
of extracted clusters to the training data (this number can 
actually grow with the data, i.e., with the number of users). 
This confers to our algorithm a strong robustness, i.e., 
CAMP exploits similarities in users’ mobility only if such 
similarities are really present in the training data. 

• We derive theoretical performance guarantees for the pre¬ 
dictions made under the CAMP algorithm. In particular, 
we show that CAMP can achieve the performance of an 
optimal predictor (among the set of all predictors) when 
the number of users grows large, and for a large class of 
mobility models. 

• Finally, we compare the performance of our predictor 
to that of other existing predictors using two large-scale 
mobility datasets (corresponding to a Wi-Fi and a cellular 
network, respectively). CAMP significantly outperforms 
existing predictors, and in particular those that only exploit 
individual past trajectories to estimate users’ next location. 

II. Related work 

Most of existing mobility prediction methods estimate the 
next location of a specific user by inspecting the past individual 
trajectories of this user. One of the most popular mobility 
predictors consists in modelling the user trajectory as an order- 
k Markov chain. Predictors based on the order-/:' Markov model 
are asymptotically optimal 0, 0 for a large class of mobility 
models. This optimality only holds asymptotically when the 
length of the observed user past trajectory tends to infinity. 
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Unfortunately, when the observed past trajectory of the user is 
rather short, these predictors perform poorly. Such phenomenon 
is often referred to as the “cold-start problem”. To improve the 
performance of these predictors for short histories, a fallback 
mechanism can be added a to reduce the order of the Markov 
model when the current sequence of k previous locations has 
not been encountered before. Alternatively, one may adapt 
the order of the Markov model used for prediction as in the 
Sampled Pattern Matching (SPM) algorithm 0, which sets 
the order of the Markov model to a fraction of the longest 
suffix match in the history. SPM is asymptotically optimal with 
provable bounds on its rate of convergence, when the trajectory 
is generated by a stationary mixing source. Another type of 
mobility predictor, Nextplace El attempts to leverage the time- 
stamps that may be associated with the successive locations 
visited by the user. Empirical evaluations El, El show that 
complex mobility models do not perform well: the order-2 
Markov predictor with fallback gives comparable performance 
to that of SPM El, NextPlace El and higher order Markov 
predictors. In addition El reports that the order-1 Markov 
predictor can actually provide better predictions than higher 
order Markov predictors, as the latter suffer more from the 
lack of training data. 

There have been a few papers aiming at clustering trajec¬ 
tories or more generally stochastic processes. For example, 
El proposes algorithms to find clusters of trajectories based 
on likelihood maximization for an underlying hidden Markov 
model. For the same problem, ED uses spectral clustering 
in a semi-parametric manner based on Bhattacharyya affinity 
metric between pairs of trajectories. Those methods would 
not work well in our setting. This is due to the facts that 
(i) users belonging to a same cluster should have trajectories 
generated by identical parameters, and (ii) the number of 
clusters should be known beforehand, or estimated in a reliable 
way. The non-parametric Bayesian approach developed in this 
paper addresses both issues. E2 also introduced Bayesian 
approach that focused on the similarity between users’ temporal 
patterns. But they do not consider the similarity between spatial 
trajectories and the correlation to the recent locations which are 
crucial to the correct predictions in our setting. 

III. Models and Objectives 

In this section, we first describe the data on past user 
trajectories available at a given time to build predictors. We 
then provide a model for user mobility, used to define our 
non-parametric inference approach, as well as its objectives. 

A. Collected Data 

We consider the problem of predicting at a given time the 
mobility, i.e., the next position of users based on observations 
about past users’ trajectories. These observations are collected 
and stored on a server. The set of users is denoted by U, 
and users are all moving within a common finite set C of 
L locations. The trajectory collected for user u is denoted 
by x u = (x“,..., x“ u ), where x“ corresponds to the f-th 
location visited by user it, and where n u refers to the length 


of the trajectory, denotes the current location of user u. 
By definition, we impose x'j ^ x“ +1 , i.e., two consecutive 
locations on a trajectory must be different. Fet x u = ( x u ) u&A 
denote the set of user trajectories. Observe that the lengths of 
the trajectories may vary across users. If the location of a user 
is sensed periodically, we can collect the time a given user 
has stayed at each location. Those staying times for user u 
are denoted by s u = (s“, ..., where s“ is the staying 

time at the f-th visited location. To simplify the presentation, 
we present our prediction methods ignoring the staying times 
s u ; but we mention how to extend our approach to include 
staying times in f]IV-B4l 

Next we introduce additional notations. We denote by n“ 7 
the number of observed transitions for user u from location i 
to j, (i.e.^nJF = E^r' l ( x t = h x t+i = j))- Similarly, 
n“ = Y% =1 1 (x“ = i) is the number of times user u has been 
observed at location i. Fet 'H C U^L 0 £" denote the set of all 
possible trajectories of a given user, and let ’H u be the set of 
all possible set of trajectories of users in U. 

B. Mobility Models 

The design of our predictors is based on a simple mobility 
model. We assume that user trajectories are order-1 Markov 
chains, with arbitrary initial state or location. More precisely, 
user-u’s trajectory is generated by the transition kernel 9 U = 
G [0, l] LxL , where denotes the probability that 
user u moves from location i to j along her trajectory. Hence, 
given her initial position x“, the probability of observing 
trajectory x u is Pqu(x u ) := 11™=^ ■ Our mobility 

model can be readily extended to order-/; Markov chains. 
However, as observed in El, order-1 Markov chain model 
already provides reasonably accurate predictions in practice, 
and higher-order models would require a fail-back mechanisnf] 
El- Throughout the paper, we use uppercase letters to represent 
random variables and the corresponding lowercase letters for 
their realizations, e.g. X u (resp. x u ) denotes the random (resp. 
realization of) trajectory of user u. 

C. Bayesian Framework, Clusters, and Objectives 

We adopt a Bayesian framework, and assume that the tran¬ 
sition kernels of the various users are drawn independently 
from the same distribution /j £ V (0 jl referred to as the 
prior distribution over the set of all possible transition kernels 
0. This assumption is justified by De Finetti’s theorem (see 
ED, Theorem 11.10) if {0 u ) u ^u are exchangeable (which is 
typically the case if users are a priori indistinguishable). In the 
following, the expectation and probability under /x are denoted 
by E and P, respectively. To summarize, the trajectories of 
users are generated using the following hierarchical model: for 

l To accurately predict the next position of user u given that the sequence 
of her past k positions is ii,...,ik, her trajectory should contain numerous 
instances of this sequence, which typically does not occur if the observed 
trajectory is short - and this is precisely the case we are interested in. 

2 V(M) denotes the set of distributions over the set A4, and © = {6 E 
[0, \ ‘ L/ '■ : V;. >;. /); ■ = 1}. 
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all u £ U, 9 U ~ n, X u ~ Pgu, and n“,X“ are arbitrarily 
fixed. 

To provide accurate predictions even if observed trajectories 
are rather short, we leverage similarities among user mobility 
patterns. It seems reasonable to think that the trajectories of 
some users are generated through similar transition kernels. 
In other words, the distribution g might exhibit a clustered 
structure, putting mass around a few typical transition kernels. 
Our predictors will identify these clusters, and exploit this 
structure, i.e., to predict the next location of a user u, we shall 
leverage the observed trajectories of all users who belong to 
user-u’s cluster. 

For any user u, we aim at proposing an accurate predictor 
x u £ C of her next location, given the observed trajectories 
X u = x u of all users. The (Bayesian) accuracy of a predictor 
x u for user it, denoted by n u (x u ), is defined as ir u (x u ) := 
P (X“ u+1 = x u \x u ) = E[0“ u (where for conciseness, 

we write P(-|x w ) = P (-\X U = x 11 )). Clearly, given X u = 
x u , the best possible predictor would be: 

x u £ argmaxE[0“ u Ax u ]. (1) 

jec 


Computing this optimal predictor, referred to as the Bayesian 
predictor with prior /i, requires the knowledge of //. Indeed: 




j g e id Pg(x u )n(&6) 

J g Pg(x u )g(d8) 


( 2 ) 


Since here the prior distribution /it is unknown, we will first 
estimate /i from the data, and then construct our predictor 
according to 0 - 0 - 


IV. Bayesian Non-parametric Inference 

In view of the model described in the previous section, we 
can devise an accurate mobility predictor if we are able to 
provide a good approximation of the prior distribution // on the 
transition kernels dictating the mobility of the various users. 
If fi concentrates its mass around a few typical kernels that 
would in turn define clusters of users (i.e., users with similar 
mobility patterns), we would like to devise an inference method 
identifying these clusters. On the other hand, our inference 
method should not discover clusters if there are none, nor 
specify in advance the number of clusters (as in the traditional 
mixture modelling approach). Towards these objectives, we 
apply a Bayesian non-parametric approach that estimates how 
many clusters are needed to model the observed data and also 
allows the number of clusters to grow with the size of the 
data. In Bayesian non-parametric approaches, the complexity 
of the model (here the number of clusters) is part of the 
posterior distribution, and is allowed to grow with the data, 
which confers flexibility and robustness to these approaches. 
In the remaining of this section, we first present an overview 
of the Dirichlet Process mixture model, a particular Bayesian 
non-parametric model, and then apply this model to the design 
of CAMP (Cluster-Aided Mobility Predictor), a robust and 
flexible prediction algorithm that efficiently exploits similarities 
in users’ mobility, if any exist. 


A. Dirichlet Process Mixture Model 


When applying Bayesian non-parametric inference tech¬ 
niques 0 to our prediction problem, we add one level of 
randomness. More precisely, we approximate the prior distri¬ 
bution fi on the transition kernels 9 U by a random variable 
fl with distribution g £ V(V(Q)). This additional level of 
randomness allows us to introduce some flexibility in the 
number of clusters present in /j. We shall compute the posterior 
distribution g given the observations x u , and hope that this 
posterior distribution, denoted as g\x u , will concentrate its 
mass around the true prior distribution /i. To evaluate g\x u , 
we use Gibbs sampling techniques (see Section II V-B1 b . and 
from these samples, we shall estimate the true prior /./, and 
derive our predictor by replacing // by its estimate in 0-0. 

For the higher-level distribution g, we use the Dirich¬ 
let Process (DP) mixture model, a standard choice of prior 
over infinite dimensional spaces, such as V(Q). The DP 
mixture model has a possibly infinite number of mixture 
components or clusters, and is defined by a concentration 
parameter a > 0 , which impacts the number of clusters, 
and a base distribution Go £ V ((-)), from which new clus¬ 
ters are drawn. The DP mixture model with parameters a 
and Go is denoted by DP(a,Go) and defined as follows. 
If v is a random measure drawn from 1)1’(a. Go) (i.e., 
v ~ DP(a, Go)), and {Ai, A 2 , ■ ■ ■ ,Ak} is a (measurable) 
partition of 0, then (i/(Ai),--- ,i/(Ak)) follows a Dirichlet 
distribution with parameters (ctGo(Ai), • ■ ■ , aGo(Ax)JE It is 
well known Cl that a sample v from DP(a,Go) has the 
form v = fi c &g c 7 where Sg is the Dirac measure at point 

9 £ 0 , the 6 ’s are i.i.d. with distribution Go and represent the 
centres of the clusters (indexed by c), and the weights /3 c ’s are 
generated using a Beta distribution according to the following 
stick-breaking construction: 


/3 C ~ Beta(l,a) (the /3 c ’s are independent), 

C— 1 

/3 c = 

i=1 

When ( O u ) u< zu is generated under the above DP mixture 
model, we can compute the distribution of 9 U given O u ~ v = 
( ® v )v£U\{u }• When Q u \ u is fixed, then users in U\ {it} are 
clustered and the set of corresponding clusters is denoted by 
Users in cluster c £ share the same transition 

kernel 9 , and the number of users assigned to cluster c is 
denoted by n c - u = l«e 0 The distribution of 9 U 

given QU\u 

is then: 


qu\qU\u 


Go 

V 


w.p. 

w.p. 


OL 

Ot+\U\-l ’ 

n c , — u 
a+|W|-l’ 


Vc £ <"\M. 


(3) 


0 makes the cluster structure of the DP mixture model 
explicit. Indeed, when considering a new user u, a new cluster 
containing user u only is created with probability a+ |^|_ 1 , and 


3 The Dirichlet distribution with parameters ( 01 , r >^')) has density 
(with respect to Lebesgue measure) proportional to l(xi > 0, ...,/Ex > 


0)1(/E1 


-BK-m i)nf= 


k= 1 








4 


user u is associated with an existing cluster c with probability 
proportional to the number of users already assigned to this 
cluster. Refer to m for a more detailed description on DP 
mixture models. 

Our prediction method simply consists in approximating 
E[0“|x w ] by the expectation w.r.t. the posterior distribution 
g\x u . In other words, for user u, the estimated next position 
will be: 

x u G arg max E g • | x u ], (4) 

where E g [•] denotes the expectation w.r.t. the probability mea¬ 
sure induced by g. To compute E g [9 u \x u ], we rely on Gibbs 
sampling techniques to generate samples with distribution 
g\x u . The way g\x u concentrates its mass around the true prior 
/i will depend on the choice of parameters a and Go, and to 
improve the accuracy of our predictor, these parameters will 
be constantly updated when successive samples are produced. 

B. CAMP: Cluster-Aided Mobility Predictor 

Next we present CAMP, our mobility prediction algorithm. 
The objective of this algorithm is to estimate E g [6 u \x u ] from 
which we derive the predictions according to @. CAMP 
consists in generating independent samples of the assignment 
of users to clusters induced by the posterior distribution g\x u , 
and then in providing an estimate of E g [6 u \x u ] from these 
samples. As mentioned above, the accuracy of this estimate 
strongly depends on the choice of parameters a and Go in the 
DP mixture model, and these parameters will be updated as 
new samples are generated. 

More precisely, the CAMP algorithm consists in two steps, 
(i) In the first step, we use Gibbs sampler to generate B samples 
of the assignment of users to clusters under the probability 
measure induced by g\x u , and update the parameters a and 
Go of the DP mixture model using these samples (hence 
we update the prior distribution g). We repeat this procedure 
K — 1 times. In the fc-th iteration, we construct B samples of 
users’ assignment. The fc-th assignment sample is referred to as 
gW.b.fc _ ( c u ’ b ’ k ) u€ u in CAMP pseudo-code, where c“ ,b,fc is the 
cluster of user u in that sample. The subroutines providing the 
assignment samples, and updating the parameters of the prior 
distribution g are described in details in d IV-B1 1 and T V-B2I 
respectively. At the end of the first step, we have constructed 
a prior distribution g parametrized by Gq and a k which is 
adapted to the data, i.e., a distribution that concentrates its 
mass on the true prior /j. (ii) In the second step, we use the 
updated prior g to generate one last time B samples of users’ 
assignment. Using these samples, we compute an estimate 6 U 
of Eg\6 u \x u ] for each user u, and finally derive the prediction 
x u of the next position of user u. The way we compute 9 U is 
detailed in ' llV-B3I 

The CAMP algorithm takes as inputs the data x u , the 
number K of updates of the prior distribution g, the number of 
samples B generated by the Gibbs sampler in each iteration, 
and the number of times M the users’ assignment is up¬ 
dated when producing a single assignment sample using Gibbs 


sampler (under Gibbs sampler, the assignment is a Markov 
chain, which we simulate long enough so as it has the desired 
distribution). K, B, and M have to be chosen as large as 
possible. Of course, increasing these parameters also increases 
the complexity of the algorithm, and we may wish to select the 
parameters so as to achieve an appropriate trade-off between 
accuracy and complexity. 

Algorithm 1 CAMP 
Input: x u , K, B, M 
Step 1: Updates of Go and a 
Gj Uniform(0), a\ 4— 1 
for k = 1... K — 1 do 

for b = 1... B do 

| c"’ b > fc <- GibbsSampler(a; w , Gg, a kl M) 

end 

G k +\ a k+1 UpdateDP(a; w , G§, {^’ b ’ k } b =i... B ) 

end 

Step 2: Last sampling and prediction 
for b = 1... B do 

| (U,b,K GibbsSampler(a: w , Gq , a K , M) 

end 

Compute 9 U by implementing © using {c u,b,K }b=i,...,B and 

r'K 

x u = arg maxj • 

Output: 9 u ,x u 

1) Sampling from the DP mixture posterior: We use Gibbs 
sampler fl6l to generate independent samples of the assign¬ 
ment of users to clusters under the probability measure induced 
by the posterior g\x u , i.e., samples of assignment with distri¬ 
bution P g [c^\x u ], where P g denotes the probability measure 
induced by g. Gibbs sampling is a classical MCMC method 
to generate samples from a given distribution. It consists in 
constructing and simulating a Markov chain whose stationary 
state has the desired distribution. In our case, the state of the 
Markov chain is the assignment c u , and its stationary distribu¬ 
tion is P g [c^\x u ]. The Markov chain should be simulated long 
enough (here the number of steps is denoted by M) so that at 
the end of the simulation, the state of the Markov chain has 
converged to the steady-state. The pseudo-code of the proposed 
Gibbs sampler is provided in Algorithm [2j and easily follows 
from the description of the DP mixture model provided in ©. 

To produce a sample of the assignment of users to clusters, 
we proceed as follows. Initially, we group all users in the same 
cluster Ci, the number of cluster N is set to 1, and the number 
of users (except for user u) n Cli _ u assigned to cluster Ci is 
\U\ — 1. (see Algorithm©. Then the assignment is revised M 
times. In each iteration, each user is considered and assigned 
to either an existing cluster, or to a newly created cluster (the 
latter is denoted by cjv+i if in the previous iteration there was 
N clusters). This assignment is made randomly according to 
the model described in ©. Note that in the definition of (3 C , 
we have G o (d0|x c ) = » where x c corresponds 

to the data of users in cluster c, i.e., x c = ( x u ) u ec■ 
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Algorithm 2 GibbsSampler 

Input: x u ,Go,a, M 

Mu GU, c“ g- ci, n Cl .- u <— \U\ — 1\ N <— 1; (^ = {ci}. 

for i = 1 ... M do 

for each u G U do 

c“ <— c“ \ {u} 

/3nei« ~ Q+ |^|_i / e Pe{x u )G 0 {A9) 

Pc <- / e P e (x u )G 0 (d9\x% Me G c«\M 

In the above expressions, 2 is a normalizing constant, 
i.e., selected so as /3 new + S c6c w\w Pc = 1; 

With probability /3 raeU j do: 

cn+i g- {u}; c u <— cjv+i; n CJV+li _ u 0; 
^cjv+i-v <- 1, Vv / it; c w «- c w U {c/v+i}; 
IV t-lV + 1; 

and with probability /3 C do: 

c“<-c;cf-cU {it}; n Ci _„ •<— n C) _„ + 1, Mv ^ u. 

end 

end 

Output: t" 


2) Updates of Go and a: As in any Bayesian inference 
method, our prediction method could suffer from a bad choice 
of parameters a and Gq defining the prior g. For example, by 
choosing a small value for a, we tend to get a very small 
number of clusters, and possibly only one cluster. On the 
contrary, selecting a too large a would result in a too large 
number of clusters, and in turn, would make our algorithm 
unable to capture similarities in the mobility patterns of the 
various users. To circumvent this issue, we update and fit the 
parameters to the data, as suggested in ED- In the CAMP 
algorithm, the initial base distribution is uniform over all 
transition kernels (over 0) and a is taken equal to 1. Then 
after each iteration, we exploit the samples of assignments of 
users to clusters to update these initial parameters, by refining 
our estimates of Gq and a. 


Algorithm 3 UpdateDP at the fc-th iteration 

Input: 

Compute Gq + 1 (.) and a k +i as follows. 


C(.) = i;E E 


B 

ft=i cec u ’ b 

\u\ 

Ufc+1 = argmin V . 

aeR 1 a + i 

i =1 


gf G °('l* c ) 


OL 1 V—^ 

-> N b 

r-1 L 


b=l 


(5) 

( 6 ) 


where n Ct b,k is the size of cluster c G c^’ 6 ^, and Nb is the 
total number of (non-empty) clusters in c^ b ' k . 

Output: Gq + 1 , afc+i 


Note that ([5} simply corresponds to a kernel density estima¬ 
tor based on the B cluster samples obtained with prior distri¬ 
bution parametrized by G (j and at, whereas © corresponds to 
a maximum likelihood estimate (see nil). which sets a k + i to 
the value which is most likely to have resulted in the average 


number of clusters obtained when sampling from the model 
with parameters Gg and a k . 

3) Computation of 0": As mentioned earlier. O'" is an 
estimator of E g [0 u \x u \ 1 where g is parameterized by Gff and 
ax , and is used for our prediction of user-w’s mobility. 0" is 
just the empirical average of 9 C for clusters c to which user-w 
is associated in the B last samples generated in CAMP, i.e., 

§U = ^'t E g [0 cU ' b ’ K \x cU ’ b ’ K ] (7) 

6=1 

l^f e O-Pe{x° u ' b ' K )G$(d0) 

B h /e Pe( xcU,b ' K )Gq (dO) ' 


Note that in view of the law of large numbers, when B 
grows large, 9 U converges to E g [9 u \x u ]. The predictions for 
user u are made by first computing an estimated transition 
kernel 9 U according to ©. We derive an explicit expression 
of 9 U that does not depend on Gq , but only on data and the 
samples generated in the CAMP algorithms. This expression, 
given in the following lemma, will be useful to understand to 
what extent the prediction of user-w’s mobility under CAMP 
leverages observed trajectories of other users. 

Lemma 1 For any i,j , Of - is computed by a weighted sum of 
all users’ empirical transition kernels (n^/n’', v G U), i.e.. 


0 ?.- = 


where r)i = 


Mi = 


Vi 


E* 

vGU 

E 

1 -CK- 

Lec K 

E $*.. 




(9) 


l^l + Efc = 1 U i '“CK fc= 1 
»?Ef=ii(vGc fc ) \u\ K 


Ck n¥ 




II W 


C\..Ck- 

il£c k 


1^1 + Efc=l n i k n °K k=1 


k 

Ck ’ 


The sum stands for EcieCi ''' Ec*eC K ’ and Ck is the 

C1..CK 

set of every cluster sampled at k-th iterations (i.e., Ck = 

{C\ Ef=i Hc u ’ b ’ k = c) > 0}). u k and £ Cl .. CJf are given 


by: 




CK 


n 

i&C 


rije£r(I + Yhk=l..K n i k j) 

m\+Ek=i..K< k ) 






K -1 


B\G\ E £ci.. CRT-1,C n 


fc=1 


where n c itj = E„ec n “p n i = T,jec n lp and n c = 

Ef=iE u6 W lR’ fc = c). 

Proof Refer to Appendix. □ 

When the current location i is hxed, the hrst term in the 
r.h.s. of © is constant over all users. The second term can 
be interpreted as a weighted sum of the empirical transition 
kernels of all users (i.e., n" j/n^,Mv G U). The weight of user 
v (jl’ in ©) quantifies how much we account for user-u’s 
trajectory in the prediction for user u at the current location 
i, and can be seen as a notion of similarity between v and 
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u. Indeed, as the number of sampled clusters in which both u 
and v are involved increases, 7 ’' in © increases accordingly. 
Also, if v has relatively high n\ compared to other users (i.e., v 
has accumulated more observations at the location i than other 
users), a higher weight is assigned to v. 

4) Estimating the Staying-times: Next we provide a way of 
estimating how long user u will stay at her current location 
i. We may perform such estimation when the available data 
include the time users stay at the various locations. Typically, 
the existing spatio-temporal predictors predict the staying time 
at the current location x“ u by computing average a or p- 
quantile a of user it’s staying times observed at her previous 
visits to x'f, . On the other hand, CAMP additionally exploits 
other users’ staying time observations using the weight 7 ?. 
More precisely, the staying time of user u at location x“,, 
(denoted by is estimated by 

sin = Yl Y2 s * ’ where i = x ln- (10) 

v£U 1 t:x”=i 

z in © is a normalization constant to make the sum of 
weights over all users equal to 1. The estimate is a 
heuristic, for 7 V is actually obtained by clustering based on 
their location trajectories x u , rather than their staying times. 
This heuristic estimate actually performs well as empirically 
shown in Section [VI-B4I 

V. Consistency of CAMP Predictor 

In this section, we analyze to what extent E g [0 u \x u ] (that 
is well approximated, when B is large, by 0" derived in the 
CAMP algorithm) is close to E[0“|x“], the expectation under 
the true prior p. We are mainly interested in the regime where 
the user population U becomes large, while the number of 
observations n“ for each user remains bounded. This regime 
is motivated by the fact it is often impractical to gather long 
trajectories for a given user, while the user population available 
may on the contrary be very large. For the sake of the analysis, 
we assume that the length n u of user-rt’s observed trajectory 
is a random variable with distribution p £ 'P(N), and that 
the lengths of trajectories are independent across users. We 
further assume that the length is upper bounded by n, e.g., 
n = max{n : p(n ) > 0} < 00 . 

Since the length of trajectories is bounded, we cannot ensure 
that \E g {0 u \x u ] — E[0"|x“]| is arbitrarily small. Indeed, for 
example if users’ trajectories are of length 2 only, we cannot 
group users into clusters, and in turn, we can only get a precise 
estimate of the transition kernels averaged over all users. In 
particular, we cannot hope to estimate E[0 u |x w ] for each user 
u. Next we formalize this observation. We denote by Pi— C £" 
the set of possible trajectories of length less than n. With finite- 
length observed trajectories, there are distributions v £ 'P('(-)) 
that cannot be distinguished from the true prior p by just 
observing users’ trajectories, i.e., these distributions induce the 
same law on the observed trajectories as p: P„ = P on 'PL— 
(here P„ denotes the probability measure induced under v, and 
recall that P is the probability measure induced by p). We 


prove that, when the number of observed users grows large, 
\E g [0 u \x u ] — E[0“|x“]| is upper-bounded by the performance 
provided by a distribution v indistinguishable from p, which 
expresses the consistency of our inference framework. Before 
we state our result, we introduce the following two notions: 
KL e-neighborhood: the Kullback-Leibler e-neighborhood 
K e> n(p) of a distribution p £ P( 0 ) with respect to Tin is 
defined as the following set of distributions: 

K t ,n(p) = {v £ P(O) : KLn(p, v) < e} , 
where KLn(p, v) = P^x) log 

KL support: The distribution p is in the Kullback-Leibler 
support of a distribution <7 £ P(P( 0 )) with respect to Pin if 
i(p)) > 0 for all e > 0 . 

Theorem 2 If p £ P(0) is in the KL-support of g with respect 
to Pin, then we have, ^-almost surely, for any i. j £ C, 

Inn \Eg[ei 3 \X u ]-nOl 3 \XY 

| La I —y 00 

< sup \E„[0? j \X u ] -E[e^\X u ]\. (11) 

P u —F on 'Hit 

Proof. Refer to Appendix. □ 

The r.h.s. of (© captures the performance of an algorithm 
that would perfectly estimate E V [9 U \X U ] for the worst distri¬ 
bution v, which agrees with the true prior p on PL—. Note that 
in our framework, for the prior g £ P(P(0)), we use is a DP 
mixture DP(Gq,o), with a base measure Go £ V{&) having 
full support 0. Therefore, the KL-support of g is here the whole 
space "P(0); it thus contains p. 

As far as we are aware. Theorem [2] presents the first 
performance result on inference algorithms using DP mixture 
models with indirect observations. By indirect observations, 
we mean that the kernels {6 u ) u ^u cannot be observed directly, 
but are revealed only through the trajectories x u . Most existing 
analysis m-ca do not apply in our setting, as these papers 
aim at identifying conditions on the Bayesian prior g and on 
the true distribution p under which the Bayesian posterior g\0 u 
will converge (either weakly or in Li-norm) to p in the limit of 
large population size. Hence, existing analysis are concerned 
with direct observations of the kernels ( 9 u ) uG u. 

VI. Empirical Evaluation of CAMP 
A. Mobility Traces 

We evaluate the performance of CAMP predictor using 
two sets of mobility traces collected on a Wi-Fi and cellular 
network, respectively. 

Wi-Fi traces ED. We use the dataset of ETl where the 
mobility of 62 users are collected for three months in Wi¬ 
Fi networks mainly around a campus in South Korea. The 
smartphone of each users periodically scans its radio environ¬ 
ment and gets a list of mac addresses of available access points 
(APs). To map these lists of APs collected over time to a set 
of locations, we compute the Jaccard indenQ between two lists 

4 Jaccard index between two lists A and B is defined as j' . 
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(b) ISP traces 

Fig. 1. Similarities between pairs of users. For the ISP traces we restrict the 
plot to 100 randomly selected users. 

of of APs scanned at different times. If two lists of APs have 
a Jaccard index higher than 0.5, these two lists are considered 
to correspond to a same geographical locations ED- From the 
constructed set of locations, we then construct the trajectories 
of the various users. 

ISP traces |[22]| . We also use the call detailed record (CDR) 
dataset provided by Orange where the mobility of 50000 sub¬ 
scribers in Senegal are measured over two weeks. We use the 
SET2 data j22ll , where the mobility of a given user is reported 
as a sequence of base station (BS) ids, and time stamps. Each 
record is obtained only when the user communicates with base 
stations (e.g., phone call, text message). 

In each dataset, we first restrict our attention to a subset 
C of frequently visited locations. We select the 116 and 80 
most visited locations in Wi-Fi traces and ISP traces datasets, 
respectively. We then re-construct users’ trajectories by remov¬ 
ing locations not in C. For the ISP dataset, we extract 200 
users (randomly chosen among users who visited at least 10 
of the locations in C). From the re-constructed trajectories, 
we observe a total number of transitions from one location to 
another equal to 8194 and 13453 for the Wi-Fi and ISP dataset. 
Users’ similarity. Before actually evaluating the performance 
of various prediction algorithms, we wished to assess whether 
users exhibit similar mobility patterns, that could in turn be 
exploited in our predictions. Here, we test the similarity of 
pairs of users only. More precisely, we wish to know whether 
the observed trajectory of user v could be aggregated to that of 
user u to improve the prediction of user-u’s mobility. To this 
aim, we use the concept of mutual prediction l23l as follows. 

We first define the empirical accuracy of an estimator 0 of 


user-u’s transition kernel: 

_ „ I 

AC u (0) = u _ 1 ( a: “ = arg max9 x u_ i j) ( 12 ) 

n t =2 J 

Let 9 U * be the maximum likelihood estimator of O'" given x u 
(i.e., Of* = £ C ). Intuitively, user-u’s trajectory 

is useful to predict the mobility of user u if &°* has a high 
empirical accuracy for user u, i.e., if AC (6 V *) is high. We 
hence define the similarity sim(u, v) of users u and v as 
sim(u,v ) = AC U (9 V *)/AC U (9 U *). Note that the notion of 
similarity is not symmetric (in general sim{u , v) ^ sim(v, u )), 
and it always takes its value between 0 and 1 . 

Fig.ra (a) and (b) present the similarity between 62 users 
in Wi-Fi trace and 100 users in the ISP subscriber dataset. To 
provide meaningful plots, we have ordered users so that pairs of 
users with high similarity are neighbours (to this aim, we have 
run the spectral clustering algorithm ED and re-grouped users 
in the identified clusters). From these plots, the similarity of 
users is apparent, however we also clearly observe that perfect 
clusters (in which users’ patterns are exactly same) do not 
really exist. From the dataset, we observe that 1.65% and 5% 
of user pairs out of all possible pairs have similarity higher than 
0.5 for the Wi-Fi and ISP traces. We also computed the number 
of users having at least one user with whom the similarity is 
higher than 0.5. In the Wi-Fi traces, we found 19 (out of 62) 
such users, whereas in the ISP traces there are 173 (out of 200) 
such users. These numbers are high, and justify the design of 
cluster-aided predictors. 

B. Prediction Accuracy 

1) Tested Predictors: We assess the performance of six 
types of predictors: the order-1 Markov predictor (Markov El), 
the order-2 Markov predictor with fallback (Markov-0(2) Pfl), 
AGG, CAMP and CAMP 6 , AGG 6 . Before describing each 
predictor, we briefly introduce some notations regarding the 
training data available at a given time. The time stamp of 
the arrival at i-th location on user-u’s trajectory is denoted by 
df £ R, and n u (d) is the length of user-u’s trajectory collected 
before time d (i.e., n u {d) = max{s|d“ < d}). The collection of 
users’ trajectories available for a prediction at time d is denoted 
by x u ’ d (i.e., x U ’ d = (x v ’ d ) vGU , where x v ’ d = (xf, ..,a^„ (d) )). 
The prediction for x" is denoted by xf. 

In order to derive an estimate of the f-th location x" of 
user u, the Markov predictors first estimate 9 U based on user- 
u trajectory only, i.e., based on x u,d t . In contrast, AGG and 
CAMP algorithms exploit the data available on all users x Ujl ' 
to estimate 0 U . The AGG algorithm tries in a very naive way 
to exploit users’ similarities. It considers that all users have 
the same transition kernel (as if there were a single cluster 
only), and thus uses all trajectories (in the same way) to 
estimate 9 U . CAMP 6 (resp. AGG 6 ) differs from CAMP (resp. 
AGG) in that its prediction at time d under for user u uses 
other users’ complete trajectories (i.e., x u ^ u ). This corresponds 
to a case where user u starts moving along her trajectory 
after other users have gathered sufficiently long trajectories. 
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TABLE I 

Order-1 predictors. 



Q u ,dt 

X t 

Markov 0 

arg ma xg Pg (x u ’ d *) 

argmax^*^. 

AGG 

arg maxfl Pg ( x u ,C T) 

CAMP 

- Eg[9 u \x u ’<] 

AGG L 

arg ma xg Pg ( x u x u,d * ) 

CAMP 

~ E g [9 u \x u \ u ,x u ’<] 


Under all algorithms, the estimated 9 U is denoted by Q u ’ d t ). 
Finally, Markov-0(2) assumes that users’ trajectories are order- 
2 Markov chains, and for the locations where the corresponding 
order-2 transitions are not observed, Markov-0(2) falls back to 
the Markov predictor. The description of the various predictors 
is summarized in Table 1. 

The parameters B, I\ and M for CAMP and CAMP c are 
set to 8 , 3 and 30. 

2) Results: We assess the performance of the various algo¬ 
rithms using two main types of metrics. The first metric, re¬ 
ferred to as the Cumulative Accurate Prediction Ratio (CAPR), 
is defined as the fraction of accurate predictions for all users 
up to time d: 


‘(d) 


CAP Ru — 




E 


u£U 


2 . 






We also introduce a similar metric that captures the cumulative 
accuracy of predictions after observing t different locations on 
users’ trajectories: 


CAPR = 


1 

(‘-UEuew 1(«“ > i) 


EE« 

u£U s=2 
n u >t 


<)■ 


The second type of metrics concerns the instantaneous accu¬ 
racy of the predictions. The Instantaneous Accurate Prediction 
Ratio (IAPR) after observing t different locations on users’ 
trajectories is defined as follows. 


IAPR = 


Eueu 1 (n“ > t) 


u€l4 ,n u >t 


Figl2a)-(b) present CAPRtime as a function of time d for 
various algorithms and for the two mobility traces. CAMP 
outperforms all other algorithms at any time. The improve¬ 
ment over Markov and Markov-0(2) can be as high as 65%. 
This illustrates the performance gain that can be achieved 
when exploiting users’ similarities. Note Markov-0(2) does not 
outperform Markov, which was also observed in 0- In the 
following, we only evaluate the performance of the Markov 
predictor, and do not report that of its order -2 equivalent. 

In FigE]( C )-(f), we plot the CAPR and IAPR as a function of 
the length t of the observed trajectory. In Fig[2tc) and fd), when 
the collected trajectory is not sufficient (i.e., t = 10), CAMP c 
and CAMP outperforms Markov by 64% and 40%, respectively. 
Regarding the IAPR in Wi-Fi traces, Fig|2je) shows that CAMP 
and CAMP c provide much better predictions than Markov, 



days (d) days (d) 


(a) Wi-Fi traces (b) ISP traces 

Fig. 3. Number of w-similar users, averaged over all users u, vs. time. 


when the length of trajectory is less than 140. After a sufficient 
training data is collected, they yield comparable IAPR. In Fig [2] 
(f), for the ISP traces, the IAPR under CAMP and Markov are 
similar sooner, for trajectories of length greater than 20 only. 

In Figj2] (g) and (h), we evaluate the CAPR and IAPR 
averaged only over users having at least one user with whom 
the similarity is higher than 0.5 (see 4 V1 - A I ) . These users are 
referred to as Mobility Friendly (MF) users. In Fig[2g), we 
observe that for MF users, the gain of CAMP c and CAMP be¬ 
comes really significant, i.e., when t= 10, the CAPR of CAMP c 
and CAMP outperform that of Markov by 102% and 65%, 
respectively. Also note that CAMP c becomes significantly 
better than CAMP for MF users. This is explained by the 
fact that we can predict the mobility of MF users much more 
accurately if we have a long history of the mobility of users 
they are similar to. The performance for MF users in the ISP 
traces is not presented, because there, most of users (i.e., 86 %) 
are already MF users. 

3) Exploiting Similarities in CAMP: Recall that, by the 
weight of the empirical transition kernel of user v (i.e., C : ) 
in computing 9 U in <|9]), we can quantify to what extent the 
observed trajectory of user v is taken into account in the 
estimate 9 U . When summing 7 " over all locations i, we get 
an aggregate indicator capturing how v impacts the prediction 
for user-vi’s mobility. To understand how many users actually 
impact the prediction for user u in the CAMP, we may look 
at the cardinality of the set of users whose aggregate indicator 
exceeds a given threshold: {u| zYlieC^i > wy}) where z is a 
normalization constant to make the sum of aggregate indicators 
over all users equal to 1. The above set is called the set of u- 
similar users. 

In Figj3] we plot the number of it-similar users, averaged 
over all users u, and as a function of the length of trajectories 
(in days d). In case of CAMP, the first day, the average numbers 
are 7 and 110 in Wi-Fi traces and ISP traces, which means 
that CAMP aggressively uses the trajectories of all users for 
its prediction. When the length of the trajectories increase, the 
average size decreases to 1.5 after one month in Wi-Fi traces 
and 2.2 after two weeks in ISP traces. In other words, as data 
is accumulated, CAMP does not use the trajectories of a lot of 
users for its prediction. This illustrates the adaptive nature of 
CAMP, which only exploits similarities among users if this is 
needed. In the case of CAMP 6 , we observe a faster decrease 
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length of trajectory (t) 


length of trajectory (t) 


length of trajectory (t) 


length of trajectory (t) 


(e) IAPR, Wi-Fi traces 


(f) IAPR, ISP traces 


(g) CAPR, Wi-Fi traces (MF) (h) IAPR, Wi-Fi traces (MF) 


Fig. 2. Performance of various predictors. 



x: Estimation error (hours) 



(a) Distribution of estimation errors (b) When Markov is unavailable 


Fig. 4. Estimation error of staying time in Wi-Fi trace. The dashed lines in 
(a) indicate the fraction of cases where the related training data was collected. 


with time of the average number of w-similar users, which 
means that CAMP' tends to utilize other users’ data more 
selectively, even at the beginning. This explains why CAMP c 
performs better than CAMP in Fig 12 

4) Error of Staying Time Estimation: In our scenario, where 
each user u arrives at /-th location :rf, a predictor estimates the 
staying time s“ with the available data. Markov predictor 0, 
0(resp. AGG) computes the average of staying times of user 
u (resp. all users) which have been measured at the location :rf 
until df. CAMP predicts sf by computing the equation ( 1 1 ()| i 
with the observed data of all users. The performance metric for 
each user u measured at /-th location is the difference between 
the estimated and acutual staying time (i.e., |s“ —s“|). We call 
it as estimation error. We test the estimation error only with 
Wi-Fi trace, because we cannot precisely observe staying time 
in ISP trace in which a location is recorded not periodically, 
but only when users randomly communicate with base stations. 

Fig [2 (a) plots CDFs of estimation errors of every user u 
and t obtained by tested predictors. CAMP provides lower 


estimation error than that of Markov and AGG. The median 
of CAMP is less than those of Markov and AGG by 35% 
and 28%, respectively. For 18% of all instances (marked as 
“Estimation failure”), Markov couldn’t provide estimations, 
because the individual users haven’t collected their staying 
times at the current location before. However in those cases 
AGG and CAMP are still able to estimate the staying time by 
using other users’ observations. In Fig|4] (b), we further test 
the estimation quality of AGG and CAMP, when Markov is 
unavailable due to lack of the individual training data. In that 
case, 43% of estimations provided by CAMP give less than 30 
minutes errors. Median of estimation errors of CAMP is 13.4% 
less than that of AGG, because CAMP selectively utilizes other 
users’ data. 


VII. Concluding Remarks 

In this paper, we have presented a cluster-aided inference 
method to predict the mobility of users in wireless networks. 
This method significantly departs from existing prediction 
techniques, as it aims at exploiting similarities in the mobility 
patterns of the various users to improve the prediction accu¬ 
racy. The proposed algorithm, CAMP, relies on Bayesian non- 
parametric estimation tools, and is robust and adaptive in the 
sense that it exploits users’ mobility similarities only if the 
latter really exist. We have shown that our Bayesian prediction 
framework can asymptotically achieve the performance of an 
optimal predictor when the user population grows large, and 
have presented extensive experiments indicating that CAMP 
outperforms any other existing prediction algorithms. Note 
also that CAMP can be implemented without damaging users’ 
privacy (the data can be anonymized). 
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Many interesting questions remain about the design of 
CAMP. In particular, we plan to investigate how to set its 
parameters ( B , K, and M) to achieve an appropriate trade¬ 
off between accuracy and complexity. These parameters could 
also be modified in an online manner while the algorithm is 
running to adapt to the nature of the data. We further plan to 
apply the techniques developed in this paper to various kind 
of mobility, e.g., we could investigate how users dynamically 
browse the web, and use our framework to predict the next 
visited webpage. 


Appendix 

A. Proof of Lemma Q] 

Observe that in view of ([5]), we have: 


G£ +1 (d0) = }'u; k c P e (x c )G k (dO), 


\ ' 




(13) 


where the sum is over all possible partitions of the set of users 
U in clusters and the weight ui k is 

k _ n c 

_ B\U\J e P 6 (x°)GUMy 

with n k = EE Ei leu 1 (c u ’ b ’ k = c). Recursively replacing 
z 0 in (fl3l > with Gq - 


Gq in (IT3l> with Gq 1 and putting Gq = Uniform(0), we obtain 


K 


another expression of G 0 
G«(dd) = 
where the sum 


as 


K -1 

E 

Cl,...,CK-l fc= 1 


n 


<Pe( 


x Ck )d 9, 


(15) 

where Ck 


E isE CieCl " Ecftr-iGCAT-i 

Cl,...,CK-l 

is a set of every cluster sampled at fc-th iterations, i.e., Ck = 
{ c l Ef=i E u&a 1 (c“ ,b,fc = c) > 0}. We can further obtain the 
recursive expression of the weights uj^' by plugging m in 

d: 


*5 = 


,K 


b\u\ e e 


Ci,...,CK-l,C 




K -1 

n < 

k=l 


(16) 


' P e (x Ck )d6 


k=l..K 


n 

iGC 


U je c r(i + E fc =i..if n ij) 

r (l £ l + J2k=l..K n i k ) ’ 


(17) 

(18) 


where n c ig = 


u ec n tr < =_Eie.c' 


Then, using equations ( IT5] > and (fl7] >. we get an expres¬ 
sion for Ofi: In ©, replacing the denominator of each sample 
b with lu k u b and plugging into numerator, we arrive at 

nu _j_ u t“, b , K B\U\ 

R n K 2^ 


b= 1 


0ijP e (x cU ’ b ' K ) 


,CK- 1 

K-l 


l P e (x Ck )oo k Ck dO 


E 

Cl,...,CK 

:u£ck 


Cci,...,c 


1 + Ef=i 1 


h3 


\U\ 


\£\ + EE n i k 


n 




(19) 


.K 


Rearranging ( IT9l ). we arrive at 


B. Proof of Theorem [2] 

The proof of Theorem |2] relies on the following two lemmas. 

Lemma 3 If /i G V (0 ) is in the KL-support of g with respect 
to Tin, then g(K e ,n(g)\X u ) —> 1 for all e > 0, p-almost 

\U \—»oo 

surely. 

The above lemma is a perfect analog of a similar statement 
for Bayesian consistency with direct observations (see ifZOl . 
Theorem 6.1 and its corollary). The proof also goes through 
essentially in the same way; therefore, we do not provide it 
here. This first lemma states that the set A'Zj(ft), he-, the set 
of distributions v that do not agree with the true prior p on 
Tin according to the KL distance KLjr(p, v) w.r.t. Tin, has a 
vanishing mass under the posterior distribution g\X U , /j-a.s. 
However, this does not guaranty that the set of distributions v 
with 0 < KLn(n,v) < e will have a negligible impact on the 
estimates E g [9 u \X u ]. Indeed, for this we need continuity with 
respect to the KL distance over Tin, which the next lemma 
provides. 

Lemma 4 Under the assumptions of Lemma Q] for any 
bounded continuous / : 0 —>• R, 

lim sup \E v [f] -E[/]| = sup \E v [f] - E[/]|. 
£_K) !/Gif e , TC (/i I/£P(0) 

Pu=P p on H— 

Proof. Let p be the metric on 0. We use the associated 
Wasserstein metric d p on V(0): 

d p (p,v) = inf < [ p(0, X)n(dd,dX) 1 , 

ner(e 2 ) I J(9,\) 

TTi—fl, 'K'2—V v ' 

where Tt\ and 7 T 2 are the first and second marginals of it, re¬ 
spectively. It is well-known (see l24lB that the space (V(0),d p ) 
is compact, complete and separable, as ( 0 ,p) is. 

Let 5 > 0 and let (ek) G be a sequence converging to 

0. For all k G N, let Vk G V(0) such that 

l^[/]-E[/]|> sup \E v [f] — E[/]| — S. 

vG V(&) 

KL W (/J.,v)<ek 

By compactness of ( V(0),d p ), there exists a converging 
subsequence (ilk) of (vk), and a corresponding subsequence 
(effc) of (efc); let us call G V(0) its limit. Clearly, we have 
D n +(p,u oo) = 0. Because the Wasserstein distance metricizes 
weak convergence (see Theorem 6.9 in [24)) and / is bounded 
and continuous, we have that lim^oo E„ k [/] = E Voo [/]. Thus, 

sup \E v [f] -E[/]| > \E Voo [f] — E[/]| 

vGV(&) 

P„=P„ on U- 

= I™ \ E H k if) — E[/]| > lim sup \E „[/] -E[f]\-6 

k ^°° vGKi k ^{n) 

= lim sup | E v [f] - E[/]| - 6, 
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where the last inequality is because the sequence is decreasing. 
Letting <5 —0 completes the proof. The opposite inequality is 
obvious by the definition of v). □ 

Proof of Theorem \2\ For any bounded continuous / : 0 —> R, 
we have 


E g [f(9 u )\X u }-E[f(9 u )\X 1 2 3 4 5 6 7 8 9 10 11 


< 


\X U ) 


+ 




E v [f(9 u )\X u ] -E[f(9 u )\X u ] dg(y\X U ), 


According to Lemma [3 the first term in the r.h.s. goes to 0 
as \U\ —> oo, /r-a.s. The second term can always be upper- 
bounded by 


sup 

VEK e ,w(aO 


E I/ [f(9 u )\X u ]-E[f(9 u )\X u } 


By Bayes theorem. 


E„\f(9 u )\X u ] = 


E v [f{9 u )P e ^X u )} 

P V (X U ) 


For any x £ Tin, Lemma H] applied to the bounded continuous 
function 9 Pg(x ) yields 


lim sup 


P v {x) - P(s) 


= 0 . 


Another application of Lemma [4] to 0 “ ^ 0 V.P 9 «(X“) 
completes the proof. □ 

Note that we could have obtained a version of the The¬ 
orem [ 2 ] giving a bound on the error in the estimation of 
any bounded continuous function f{9 u ) by simply using the 
function f(9 u )Pg^(X u ) in the last line of the above proof. 
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